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Abstract 



Data management systems, like database, information extraction, 
information retrieval or learning systems, store, organize, index, re- 
trieve and rank information units, such as tuples, objects, documents, 
items to match a pattern (e.g. classes and profiles) or meet a re- 
quirement (e.g., relevance, usefulness and utility). To this end, these 
systems rank information units by probability to decide whether an 
information unit matches a pattern or meets a requirement. Classi- 
cal probability theory represents events as sets and probability as set 
measures. Thus, distributive and total probability laws are admitted. 
Quantum probability is a non-classical theory nor does admit distribu- 
tive and total probability laws. Although ranking by probability is far 
from being perfect, it is optimal thanks to statistical decision theory 
and parameter tuning. 

The main question asked in the paper is whether further improve- 
ment over the optimality provided by probability may be obtained 
if the classical probability theory is replaced by quantum probability 
theory. Whereas classical probability (and detection theory) is based 
on sets such that the regions of acceptance / rejection are set-based 
detectors, quantum probability is based on subspace-bascd detectors. 

The paper shows that ranking information units by quantum prob- 
ability differs from ranking them by classical probability provided the 
same data used for parameter estimation. As probability of detection 
(also known as recall or power) and probability of false alarm (also 
known as fallout or size) measure the quality of ranking, we point out 
and show that ranking by quantum probability yields higher probabil- 
ity of detection than ranking by classical probability provided a given 
probability of false alarm and the same parameter estimation data. 

As quantum probability provided more effective detectors than 
classical probability within other domains that data management, we 
conjencture that, the system that can implement subspacc-based de- 
tectors shall be more effective than a system which implements a set- 
based detectors, the effectiveness being calculated as expected recall 
estimated over the probability of detection and expected fallout esti- 
mated over the probability of false alarm. 
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1 Introduction 



Data management systems, like database, information retrieval (IR), infor- 
mation extraction (IE) or learning systems, store, organize, index, retrieve 
and rank information units, like tuples, objects, documents, items. A wide 
range of applications of these systems have emerged that require the manage- 
ment of uncertain or imprecise data. Important examples of data are sensor 
data, webpages, newswires, imprecise attribute values. What is common to 
all these applications is uncertainty and then that they have to deal with 
decision and statistical inference. 

Ranking is perhaps the most crucial task performed by the data manage- 
ment systems which have to deal with uncertainty. In many applications, 
ranking aims at deciding or inferring, for example, the class assigned to a 
unit or the order by relevance, usefulness, or utility of the units delivered to 
another application or to an end user. In addition, ranking is performed to 
decide whether a unit is placed at a given rank. 

The management of imprecise data require means for ranking information 
units by probability. Ranking places information units in a list ordered by 
a measure of utility, cost, relevance, etc.. A probability theory measures the 
uncertainty of the decision. To this end, the definition of an event space and 
the estimation of probabilities are necessary steps for representing imprecise 
data and making predictions within many contexts of data management like 
machine learning, information retrieval or probabilistic databases. 

The measurement of the imprecision and the uncertainty in the data leads 
to the definition of regions of acceptance of a predefined set of hypotheses, 
thus bringing many decision problems to the calculation of a probability of 
detection and of a probability of false alarm. Although the data manage- 
ment systems reach good results thanks to classical probability theory and 
parameter tuning, ranking is far from being perfect because useless units are 
often ranked at the top or useful units are missed. 

Classical probability theory describes events and probability distributions 
using sets and set measures, respectively, according to Kolmogorov's ax- 
ioms In contrast, quantum probability theory describes events and 
probability distributions using Hermitian operators in the complex Hilbert 
vector space. Whereas parameter tuning is performed within a fixed proba- 
bility theory, the adoption of quantum probability entails a radical change. 
Furthermore, whereas classical probability is based on sets such that the re- 
gions of acceptance or rejection are set-based detectors (i.e., indicator func- 
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tions), quantum probability is based on subspace-based detectors and the 
detectors are projector-based. Note that the use of quantum probabihty 
does not imply that quantum phenomena are investigated in the paper; we 
are interested in the formalism based the Hilbert vector spaces instead. 

The main question asked in the paper is whether further improvement 
may be obtained if the classical probability theory is replaced by the quan- 
tum probability theory. The paper shows that ranking information units by 
quantum probability yields different outcomes which are in principle more ef- 
fective than ranking them by classical probability given the same data avail- 
able for parameter estimation. The effectiveness is measured in terms of 
probability of detection (also known as recall or power) and probability of 
false alarm (also known as fallout or size). 

We structure the paper as follows. Section [2] illustrates the basics of 
the probability theory through a view that encompasses both theories. Sec- 
tion [3] compares quantum detection with classical detection. Section H] shows 
that the ranking by quantum probability more effective than the ranking by 
classical probability. Section |5] provides an interpretation of the projectors 
which define the regions of acceptance and rejection. Section [H] describes the 
algorithm for ranking information units by quantum probability. Section [7] 
provides an overview of the related work. 

2 Classical Probability and Quantum Proba- 
bility 

In this section, we introduce a special view of probability distributions for 
the classical theory of probability. The same view is also introduced for 
quantum probability, which is a non-classical theory and does not admit the 
distributive law, to provide a general framework for quantum and classical 
probabilities; the view is depicted in Figure [1] 

Before introducing the view of probability theory, some basic definitions 
are provided. A probability space is a set of mutually exclusive events such 
that each event is assigned a probability between and 1 and the sum of 
the probabilities over the set of events is 1. For the sake of clarity, we 
introduce the case of binary event spaces because it is the simplest and most 
common in data management - keyword occurrence in webpages, binary 
features in sample records or binary attribute values in relational tables are 
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some examples. The case of binary event spaces are usually represented by 
mutually exclusive scalars like and 1. If binary scalars are used, the mutual 
exclusiveness is given by the scalar product, for example, 01 = (see [1]). 

Whereas the scalars {0, 1} is a possible representation of events, the vec- 
tors of a complex finite-dimensional are another option. When using vectors, 
an event is (0,1)' and its complement is (1,0)'. The representation of the 
events must encode the mutual exclusiveness. If binary vectors are used, the 
mutual exclusiveness is given by the inner product, i.e., (0,1) (1,0) = 0. 

When the event space is not binary (e.g., when the events are represented 
by k natural numbers 0, 1, . . . , A; — 1), a binary representation can again be 
used. The vector (0, ■ ■ ■ ,0, 1) is assigned to symbol 0, the vector (0, ■ ■ ■ ,1, 0) 
is assigned to symbol 1, and so on until (1, ■ ■ ■ ,0, 0)' is assigned to symbol 
k — 1 . Whatever the representation is used, the inner product between two 
vectors must be and their norm must be 1. 

The mapping between the probabilities and the events is called "proba- 
bility distribution" which is a function mapping a mathematical object which 
represents an event to a real number ranging between and 1. The difference 
between classical probability and quantum probability; the difference is due 
to the way the event space and the probability distribution are represented. 

The starting point of the view of probability used in the paper is the 
algebraic form of the probability space. To this end, Hermitiaiu (or self- 
adjoint) linear operators are used. In quantum mechanics, "operator" is 
preferred to "matrix" yet in the paper, for the sake of clarity, "matrix" 
is preferred because, for a fixed basis, the matrices are isomorphic to the 
operators. A matrix is Hermitian when it is equal to its conjugate transpose. 
Hermitian matrices are important because their eigenvalues are always real. 
In particular, Hermitian matrices with trace 1 is the key notion in quantum 
probability because the sum of the eigenvalues is 1 and, thus, the eigenvalues 
can be viewed as a probability distribution. 

The projector is an idempotent Hermitian matrix. Every subspace has 
one projector and then the projectors are 1:1 correspondence with the sub- 
spaces. Each vector corresponds to one projector with rank one defined as 
the outer product of the vector by its conjugate transpose. There are two 
main instructions for representing events using projectors: 

• the projectors must be mutually orthogonal for representing the mutual 
exclusiveness of the events, and 

^ "Symmetric" is adopted in the real field. 
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• the projectors must have trace 1 for making probabihty calculation 
consistent with the probability axioms. 

An event space and a probability function defined over it are represented 
using Hermitian matrices with trace 1. In particular, a projector represents 
an event and an event space is modeled by a collection of projectors. As 
the union of the events results in the whole event space, the sum of the 
projectors of a collection corresponding to an event space results in the unity. 
More specifically, if {Eq, . . . , Efc_i} is a collection of mutually orthogonal 
projectors, 

Eq + . . . + Efc_i = I , 

the latter being termed "resolution to the unity". For example, using the 
Dirac notation introduced in Appendix |A1 the projector of two events are 
represented by 

10)^ (°) (1) 

and 

/ 1 \ / \ 

10X01^ J (2) 

However, there is not a unique representation of an event space. For example, 
the following vectors are also representing mutually exclusive events: 



(3) 



thus leading to a different resolution to the unity given by the following 
projectors 









( V2 






72 J 


\ ~72 




(4) 



The second kind of Hermitian matrix of a probability space is the density 
matrix; the density matrix encapsulates the probability values assigned to 
the events. In Physics, a density matrix represents the state of a microscopic 
system, such as a particle, a photon, etc.. The structure of a microscopic 
system is unknown. Yet a device can measure the system to obtain some 
information. A microscopic system is similar to a urn of colored balls. The 
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internal composition of the urn is always unknown even if opened and ob- 
served because the device disturbs the state (i.e., the distribution of the 
colors) of the urn. 

In data management and in other domains different from Particle Physics, 
a system is macroscopic instead. Examples of macroscopic systems in data 
management are webpages, customers, queries, clicks, tuples, attributes, and 
so on. The states of these systems correspond to the probability densities 
according to which keywords, reviews, attribute values are observables to be 
measured from such systems. Density matrices are a powerful formalism in 
the macroscopic worlds too because they allow us to introduce the algebraic 
approach adopted for incorporating the more powerful probabihty space and 
decision rule suggested in the paper. 

To the end of introducing the way density matrices are defined, con- 
sider two equiprobable events, e.g., the occurrence of a feature or a posi- 
tive/negative customer review. The probability distribution is (|, ^) where 
each value refers to an event. As an alternative to a list, the probability dis- 
tribution can be arranged along the diagonal of a two-dimensional matrix and 
the other matrix elements are zeros. For example, the matrix corresponding 
to the probability distribution of two equally probable events is 











. ,pk) of a /c-event space can be 




In general, the probability distribution (pi, 
written as 

/ Pi •• 

P2 ■■ 

: : .. : 

(^0 ■ ■ ■ Pk J 

A probability distribution is pure when the density matrix is a projector, 
otherwise, the distribution is mixed. A distribution is mixed when the density 
matrix is a mixture of density matrices; a pure distribution is an instance of 
mixture with one matrix. The density matrix representing a pure distribution 
is 1:1 correspondence with a density vector such that the projector is the 



macroscopic system in data management is thus not a computer systems as far as 
it is concerned in the paper. 
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outer product between the vector and its conjugate transpose. A classical 
probability distribution is pure when the probability is concentrated on a 
single elementary event which is the certain event and then has probability 
1. 

Given a density matrix, the spectral theorem helps find the underlying 
events and the related probabilities. Because of the importance of the spec- 
tral theorem, we provide its definition below: 

Theorem 1 To every Hermitian matrix A on a finite- dimensional complex 
inner product space there correspond real numbers ao, ■ ■ ■ , Or-i d'^d rank- 
one projectors Eq, . . . , E^-i so that the aj 's are pairwise distinct, the Ej are 
mutually orthogonal, X]j=o -^j — ^> J2^j=o'^j — ^ ^'^^ ^^=0*^^-^^ = A. 

Proof See [TDl page 156]. I 

The eigenvalues are the spectrum and Eq, . . . , E^-i are the projectors of the 
spectrum of A. From Theorem [H thus, a pure distribution is always a rank- 
one projector. 

The spectral theorem says that any Hermitian matrix corresponding to 
a distribution can be decomposed as a linear combination of projectors (i.e. 
pure distributions) where the eigenvalues are the probability values associ- 
ated to the events represented by the projectors. The eigenvalues are real 
because the decomposed matrix is Hermitian, are non-negative and sum to 
1. For example, when the matrix corresponding to the distribution of two 
equally probable events is considered, the spectral theorem says that 





A mixed distribution have more non-zero eigenvalues, a pure distribution has 
a single eigenvalue 1. 

In classical probability, every pure distribution represented by a diagonal 
density matrix corresponds to a projector. However, in general, a density 
matrix is not necessarily diagonal, yet the matrix is necessarily Hermitian. 
For example, @ are trace-one projectors and correspond to pure distribu- 
tions, thus there is a certain event (with probability 1) and an impossible 
event (with probability 0, of course). Yet, they are not diagonal. When, for 
example, keyword occurrence in webpages is represented, the first projector 
may be assigned eigenvalue 1 and the other is assigned eigenvalue 0. Thus the 
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former represents the certain event and the latter represents the impossible 
event in the probability space. 

When using the algebraic form to represent probability spaces, the func- 
tion for computing a probability is the trace of the matrix obtained by mul- 
tiplying the density matrix by the projector corresponding to the event. The 
usual notation for the probability of the event represented by projector E 
when the distribution is represented by density matrix p is 

tr(pE) (5) 

also known as Born's rule [23]. For example, when p = /i 






the probability is 

tr(/iE) = tr 





When E = is a rank-one projector, the trace-based probability function 

can be written as 

tr(pE) = {x\p\x) 
When p is a rank-one projector \y){y\, then 

tr(pE) = |(x|y)p 

From the example, the definition of a function that computes the probability 
of an event when the probability is already allocated in the diagonal of the 
density matrix may be odd. However, we have shown that not all the density 
matrices corresponding to a distribution need to be diagonal matrices and 
the diagonal elements do not necessarily correspond to probability values, 
although they do have to sum to 1. 

A density matrix encapsulate the values assigned to the events by a prob- 
ability function because of Gleason's theorem stated below and proved in |12]. 



Theorem 2 To every probability distribution on the set of all projectors in 
a complex vector space with dimension greater than 2 there corresponds a 
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unique density matrix p on the same vector space for which the probability of 
the event represented by a projector \x){x\ is tr(p|x)(x|) for every unit vector 
X in the vector space. 

Basically, the theorem tells us that corresponding to a probability distribu- 
tion is one density matrix such that the probability of any event represented 
as a projector is calculated by the trace function. 

The probability of an event when computed using a mixture differs from 
the probability computed using a pure state, yet they share the classical 
probability term whereas the difference is called interference term. Using a 
mixture, 

f kil' \ 

tT{n\x){x\) = \ao\'^\bo\^ + \ai\^\bi\'^ /i = , ,o (6) 

Using superposition, 

tr(p|x)(x|) = laoH^oP + kiPl^iP + 2|ao| |&o| l^il l^ol cos6' (7) 

where p = \v){v\ and 6 is the angle of the polar representation of the complex 
number aoboaibi. Suppose, as an example, that \x) represents the event "the 
keyword occurs" and the density matrix represents the probability distribu- 
tion of keyword occurrence in useful webpages. The common factor (i.e. ([6])) 
is the sum of two probabihties; the probability that the webpage is not use- 
ful (|aoP) multiplied by the probability that the keyword occurs in a useless 
webpage (l&oP)! and the probability that the webpage is useful (|aip) multi- 
plied by the probability that the keyword occurs in a useful webpage 
The sum is nothing but an application of the law of total probability. 

The quantity 2|ao| |6o| |&i| l^ol cos 9 is the interference term. As the interfer- 
ence term ranges between —1 and the probability of keyword occurrence 
computed when usefulness is superposed with uselessness becomes different 
from the common factor in which usefulness and uselessness are mutually 
exclusive and their probability distribution is described by a mixture. The 
interference term can be so large that the law of total probability is violated 
and any probability space obeying Kolmogorov's axioms cannot admit the 
probability values |aip and thus requiring the adoption of a quantum 
probability space |21 E] • 
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3 Quantum Probability and Decision 



In general, the information stored in the data is acquired and delivered 
through information unit representation and ranking, these processes are 
described in terms of decision and estimation, and they are therefore af- 
fected by error. The error could be eliminated only if precise and exhaustive 
methodological tools and computer systems were developed. Nevertheless, 
there is a trade-off between precision, exhaustivity and the computation cost 
because high level of the former can be achieved only if a high computation 
cost is devoted. Thus, a certain amount of error is unavoidable yet can be 
controlled and limited below a given threshold. 

Either a set of statements, or hypotheses, must be decided to best describe 
the information unit insofar as data permit to judge (e.g., the best topic(s) to 
which a webpage is assigned), or the values of certain quantities (also known 
as parameters) characterizing the information unit must be estimated, the 
probability of detection and the probability of false alarm related to a decision 
must be calculated. In the paper, a great deal of attention is paid to decision 
whereas estimation is set apart not because estimation is little important, 
but because estimation would require another research stream had it to be 
addressed to the appropriate level of exhaustivity. 

Many tasks in data management arc decision problems, examples are the 
classification of images with respect to predefined patterns, the categorization 
of webpagcs to topics, contextual advertising (i.e., the decision whether an ad 
has to displayed in a search engine result page), the retrieval and ranking of 
webpages (i.e., the decision as to whether a webpage has to put at rank r of 
a search engine result page), probabilistic databases (i.e., the decision about 
the correct value of an attribute and then the computation of the associated 
probability) . 

Our illustration of decision theory is necessarily brief and confined to its 
simplest aspects and examples. The illustration is also organized in such a 
way as to bring out most clearly the parallels between classical probability- 
based decision and quantum probability-based decision. The examples are 
chosen from elementary information retrieval or machine learning theory and 
perhaps provide a basis for comparison with the quantum case. 

A certain information unit (e.g., a webpage or an store item) is observed 
in such a way as to obtain numbers (e.g., the PageRank or the number of 
positive reviews) on the basis of which a decision has to be made about 
its state. The numbers observed are, for example, the frequency of a fea- 
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ture in the information unit, the simplest example being the frequency of a 
keyword in a webpage used for calculating search engine statistical ranking 
functions. For the sake of clarity, we use the binary frequency and the fea- 
ture presence/absence case in the paper. The state might be, for example, 
the relevance of the webpage to the search engine user's interests or the cus- 
tomer's willingness to buy the store item. The use of the term "state" is not 
coincidental because the numbers are observed depending upon the density 
matrix, which is indeed the mathematical notion implementing the state of 
a system. Thus, quantum probability ascribes the decision about the state 
of an information unit to testing the hypothesis that the density matrix has 
generated the observed numbers. 

Consider the hypothesis that the state of the system is the density matrix 
pi and the alternative hypothesis that the state of the system is the density 
matrix po- The two hypotheses can be labeled Hi and Hq, respectively. In 
data management, hypothesis Hq asserts, for example, that a customer does 
not buy an item or that a webpage shall be irrelevant to the search engine user 
whereas hypothesis Hi asserts that an item shall be bought by a customer or 
that a webpage shall be relevant to the user. Therefore, the probability that, 
say, a feature occurs in an item which shall not be bought by a customer or 
a keyword occurs in a webpage which shall be irrelevant to the search engine 
user depends on the state (i.e., the density matrix). 

Statistical decision theory is a old topic and Neyman-Pearson's lemma 
is by now one out of the most important results which provides a criterion 
for deciding upon hypotheses instead of the Bayesian approach. The lemma 
provides the rule to govern the decider's behaviour and decide the true hy- 
pothesis without hoping to know whether it is true. Given an information 
unit and an hypothesis about the unit, such a rule calculates a specified 
number (e.g., a feature) and, if the number is greater than a threshold reject 
the hypothesis, otherwise, accept it. Such a rule tells nothing whether, say, 
the item shall be bought by the customer, but the lemma proves that, if the 
rule is followed, then, in the long run, the hypothesis shall be accepted at 
the highest probability of detection (or power) possible when the probability 
of false alarm (or size) [12] is not higher than a threshold. The set of the 
pairs given by size and power is the power curve which is also known as the 
Receiver Operating Characteristic (ROC) curve. 

Neyman-Pearson's lemma implies that the set of the observable num- 
bers (e.g., features) can be partitioned into two distinct regions; one region 
includes all the numbers for which the hypothesis shall be accepted and is 
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termed acceptance region, the other region includes all the numbers for which 
the hypothesis shall be rejected and is termed rejection region. For exam- 
ple, if a keyword is observed from webpages and only presence/absence is 
observed, the set of the observable numbers is {0, 1} and each region is one 
out of possible subsets, i.e., 0, {0}, {1}, {0, 1}. 

The paper reformulates Neyman-Pearson's lemma in terms of subspaces 
instead of subsets to utilize quantum probability. Therefore, the region of 
acceptance and the region of rejection must be defined in terms subspaces. 
In the following, we illustrate the algorithm for calculating the most efficient 
test in Hilbert spaces. The following result holds: 

Theorem 3 Let pi,po be the density matrices under Hi,Hq, respectively. 
The region of acceptance at the highest power at every size is given by the 
projectors of the spectrum of 

Pi-Xpo A > (8) 
whose eigenvalues are positive. 

Proof See [H]. ■ 



Definition 1 An optimal projector is a projector which identifies the region 
of acceptance and the region of rejection according to Theorem O 

Definition 2 We define the discriminant function as 

tr((pi - Apo)E) (9) 

where E zs a projector. If the discriminant function is positive, the observed 
event represented by E is placed in the region of acceptance. 

Suppose that the density matrix that corresponds to Hi is a mixed, classical 
probability distribution. The mixed case is the usual method for dealing with 
uncertainty in data management, even though more than one feature may 
exist or the feature may not be binary; however, the number of features or 
the number of values of a feature is not essential in the paper. Let pi be 
such a mixed distribution and 



/xi=PiPi + (l-pi)Po= _ (10) 
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where 




(11) 



(12) 



When Pi is observed, the power and the size are, respectively, 
Pd = tr(/iiPi) =pi Pq = tr(/ioPi) = Po 



(13) 



In the classical case, Pq, Pi represents the absence and the presence, respec- 
tively, of a feature. Hence, the possible acceptance or rejection regions are 
0,Po,Pi,I which correspond respectively to "never accept", "accept when 
the feature does not occur", "accept when the feature occurs" and "always 
accept". Thus, the decision on, say, webpage classification, topic catego- 
rization, item suggestion, can be made upon the occurrence of one or more 
features because Pq, Pi represent "physical" events. Furthermore, the dis- 
criminant function in the mixed case is 



The power curve can be built as follows. Suppose, as an example, that a 
keyword describes webpage content and that that webpage either includes 
(Pi = |1)(1|) or does not include (Pq = |0)(0|) the keyword. Pq = only 
if = and this point corresponds to the event represented by 0. Let 
Pi,Po be the probability that the keyword occurs in a relevant webpage or 
in a non-relevant webpages, respectively. When the keyword is the unique 
observed feature, the webpage is presented to the user if pi > Xpo and the 
keyword occurs, or pi < Xpo + (1 — A) and the keyword does not occur. The 
power curve includes the points {po,Pi) and (1 — Po, 1 — Pi)- 

The key point is that a mixture is not the unique way to implement 
the probability distributions. As we illustrate in Section |2l the superposed 
vectors 



tr((/ii - A;Uo)E) 



Eg{0,Po,Pi,I} 



(14) 




(15) 



yield the pure densities 




Vpi(X^^Pi) 1 - Pi 



Pi v^Pi(1 -pi) 



) 



'1 



(16) 
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, Po Vpo(I-Po) \ 

po = I ^— = (17) 

a/po(1-Po) 1-Po / 
which replace the mixed densities. Theorem |3] instructs us to define the 
optimal projectors as those of the spectrum of (IHl) whose eigenvalues are 
positive, the spectrum being 

'7oQo + '7iQi Qo = |'7o)(^?o| Qi = l'7i)('7i| (18) 
where the r/'s are eigenvalues, 

r7o = -i? + ^(l-A)<0 r/i = +i?+i(l-A)>0 (19) 

and 



R = ^/1(1-A)+A(1-|XP) (20) 



IS 



= Vp^Vp'i + Vl-PoVl-Pi (21) 
= K^ol^i)!' (22) 

(see [HJ). |Xp is the distance between densities defined in [2l]; |Xp 
the squared cosine of the angle between the subspaces corresponding to the 
density vectors. The justification of viewing |Xp as a distance comes from the 
fact that "the angle in a Hilbert space is the only measure between subspaces, 
up to a costant factor, which is invariant under all unitary transformations, 
that is, under all possible time evolutions.' 



Definition 3 Qo, Qi are the optimal projectors in the pure case and Pq, Pi 
are the optimal projectors in the mixed case. 

The probability of detection (i.e., the power) Qd and the probability of 
false alarm (i.e., the size) Qo in the pure case are defined as follows: 

«o - ^^^^ 

Finally, Qd can be defined as function of Qq: 



Qoy/l-\X\^] o<go< 

\X\^<Qo < 1 



(25) 
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so that the power curve is obtained (see 

Expressions f ll6p and fll7l) have no counterpart in classical probability and 
are among the essential points of the paper because they allow us to improve 
ranking yet using the same amount of evidence as the evidence used in the 
classical probability distribution ffTOj) and f|T2|) . 

At this point, there are three main issues: 

• the numerical difference between the classical and the quantum proba- 
bilities of detection at every given probability of false alarm, 

• the interpretations of Pq, Pi and Qo, Qi and whether the interpreta- 
tions can be tied together, 

• how the optimal projectors Qo, Qi in the pure case can be used for 
ranking information units in a data management system. 

The issues are addressed in Section HJ O and El respectively. 

4 Optimal Projectors in the Quantum Space 

The following lemma shows that the power of the decision rule in quantum 
probability is greater than, or equal to, the power of the decision rule in 
classical probability with the same amount of information available from the 
training set to estimate po, Pi- 
Lemma 1 Qd> Pd at every given false alarm probability. 
Proof The equality holds only if Pj = Qj, i = 0, 1: 

tr(piPi) = ( 1 ) pi I M = = tr(piPi) 



tr(poPi) = (1 0)po\^^ j =po = tr(poPi) 

Let X = Pq be a certain false alarm probability and let Qd{x), Pd{x) be the 
real, continuous functions yielding the detection probabilities at x. Qd admits 
the first and the second derivatives in the range [0, 1]. In particular, Q'^ < 
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in [0, 1]. P(i is a continuous function. Consider the polynomial Lq{x) of order 
1 passing through the points (0, 1 — \X\'^) and (poPi) at which Lq intersects 
Qd- Then, the Lagrange interpolation theorem can be used so that 

Qd{x) - Lq{x) = Q'^{c)x{x - po)/2 

the latter being non negative because Q'^ < and < x < Pq. The number 
c G [0,po] exists due to the RoUe theorem. As Lq{x) > Pd{x),x G [0,po]) 
hence, Qd{x) > Pd{x),x G [0,po]- Similarly, consider the polynomials Li{x) 
and L2{x) of order 1 passing through the points {po,pi), (1 — Po, 1 — Pi) and 
(1 — po,l — pi), (1, 1) at which Li and L2 intersect Qd, respectively. Then, 
the Lagrange interpolation theorem can again be used so that 

Qd{x) - L,{x) = Q';i{c){x - po){x - 1 + po)/2 

Qd{x) - L2{x) = Qd{c){x - l+Po){x - l)/2 

Then, Qd{x) > Pd{x) for all a; G [0, 1]. ■ 

The power Qd can plotted against the size Qo, thus producing the power curve 
of the classical decision rule and the power curve of the quantum decision. 
A graphical representation is provided in Figure |2j 

Example 1 Suppose that ten information units have been used for training 
a data management system. Each unit has been indexed using one binary 
feature and has been marked as useful (1) or useless (0). The training set is 
summarized by Table\^ 



unit 


Ml 


U2 


M3 


M4 


Ms 


Me 


M7 


Ms 


Mg 


^10 


feature 


1 


1 


1 


1 


1 

















use 


1 


1 


1 








1 















Table 1 : The training set of Example [T] 

Therefore, 

Pi = l po = l |Xp = 0.91 = 0.09 . 

5 5 

The computation of Pd, Pq follows from f|T3|) and the computation of Qd, Qo 
follows from fl23|) . When Pq = Qq, we have that Qd = Pd- 
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5 Interpretation of the Optimal Projectors in 
Data Management 

In this section, some interpretations of the optimal projectors representing 
the region of acceptance are provided. The optimal projectors Qo, Qi in the 
pure case have a more difficult interpretation than Pq, Pi because the latter 
represent "physical" observations (e.g., a customer review does exist or does 
not) whereas Qo, Qi cannot be expressed in terms of Pq, Pi and we cannot 
explain the Qj's by saying that, for example, they represent the presence 
and/or the absence of a feature. In quantum theory, the impossibility of 
expressing a projectors as functions of other projectors is termed incompat- 
ibility which is expressed mathematically as QiPj ^ PjQi,Pj 7^ Qj. 

The interpretation of the optimal projectors reflects on the interpretation 
of what means that they are "observed" in an information unit; for example, 
if the information unit is a commercial item suggested to a customer, what 
does the "observation" of Qi mean? What should we observe from an infor- 
mation unit so that the observation outcome corresponds to the projector? 
The question is not a futile because the answer(s) would effect the algorithms 
(e.g., automatic indexing) used for representing the informative content of 
the unit. Specifically, the interpretation of Qo, Qi provides what the retrieval 
algorithm must do when a feature is observed. Either the interpretation of 
an optimal projector is implemented at indexing time or at query time, an 
internal memory representation in terms of data structures is necessary for 
automatic processing and the representation needs the observation of physi- 
cal properties which are then converted into numbers. In the mixed case, the 
answer is quite straightforward because the optimal projectors correspond to 
the feature occurrence and separate the units indexed by the feature from 
those not indexed. In the pure case, the answer is not straightforward at all. 
If Pq, Pi represent feature occurrence, the Q's cannot be a feature occur- 
rence, but they are something new which cannot be described in convential 
way. 

Geometrically, each vector is a superposition of other two independent 
vectors. Figure [3] depicts the way the vectors and the spanned subspaces 
(i.e., projectors) interact and shows that |?7i), \r\Q) are placed symmetrically 
"around" the density vectors and the probability of error is minimized pT] . 
The observation of a binary feature places the observer upon either |0) or |1) 
and there is no way to move upon |?7o) or Ir^i). 
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Probabilistically, the optimal projectors and the density vectors are re- 
lated as follows: 



a/1 - Qo|'7o) + V^l^o) 




(26) 
(27) 



- Qdlvo) + Vo'dlvi) 



where 



Qo 



\{Vi\0){0\^,) + {Vi\l){l\vi)f 
|(r7i|0)(0|^o) + (r/i|l)(l|^o)|' 



(28) 
(29) 



Logically, the projectors are assertions, thus a parallel can be established 
with assertions and subsets - an assertion defines the elements of the universe 
(e.g., an event space) which belong to a subset. The basic difference between 
subspaces and subsets is that the vectors belong to a subspace if and only if 
they are spanned by a basis of the subspace. A containment relationship can 
be established between subspaces such that if a subspace (e.g., a line) includes 
a point, then every subspace (e.g., a plane) containing the fine includes the 
point too. The subspace spanned by the projector A is termed as L{A) 
and the containment relationship between L{A) and L^B) can be encoded 
as AB = B such that for every vector |a;), AB|a;) = B|x) |22l Ch. 5]. 

The paper considers the information units relevant, useful or interesting 
when they are included by the subspace L{\{pi)) spanned by \ipi). Suppose 
that a subspace L{A) is given and that \{x — y\x — y)]"^ is the metric defined 
on the whole space to which L{A) belongs. 

Proposition 1 \y*) = arg|^^gj^(^) min \{x — y\x — y)\'^ = A\x) 



As \{x ~ y*\x — y*)]"^ = tT{\x — y*){x — y*\) = 1 — \y*) maximizes 

the probability 



Note that when \x) is in L{\(Pq)), (1501) is the distance between the L(|y9o))'s 
defined in [23]. 

The result establishes a connection between the geometric, probabilistic 
and logical interpretation of the projectors even though they seems different. 
In the next section, these interpretations are tied together, thus allowing us 
to look for a criterion for assigning an information unit to the best region as 
explained. 



Proof See [H 121 



tT{\x){y*\) 



(30) 
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6 Implementation of the Optimal Ranking 



The problem is to decide whether an information unit represented by a bi- 
nary featur^ is considered relevant, useful, interesting, etc.. An algorithm 
implementing such a decision rule should perform as follows. It reads the 
feature occurrence symbol (i.e., either or 1); check whether the feature 
is included by the region of acceptance. If the feature is not included, the 
hypothesis of relevance, usefulness, interest, etc. is rejected. 

Another view of the preceding decision rule is the ranking of the infor- 
mation units. When ranking information units, the system returns the units 
whose features lead to the highest probability of detection, then those whose 
features lead to the second highest probability of detection, and so on. When 
a binary feature is considered, the ranking ends up to placing the units whose 
features lead to the highest probabihty of detection whereas the other units 
are not retrieved. 

As we point out in Section El the observation of features corresponding 
to Pi,Po cannot give any information about the observation of the events 
corresponding to Qi, Qo due to the incompatibility between these pairs of 
events. Thus, we cannot design an algorithm implementing the decision rule 
so that the observation of a feature can be translated into the observation of 
the events corresponding to Qo, Qi- 

A possible approach can be based on the probabilistic interpretation of the 
optimal projectors. According to such an approach, the probability that the 
event corresponding to an optimal projector occurs provided that a feature 
occurs can provide a measure of the degree to which the event had occurred if 
it could have been observed. When the subspaces represent these events, the 
probability that the event corresponding to an optimal projector is observed 
if the state is described by pi is Pd- Such an approach is partially satisfactory 
because Pd ^ Qd- 

As an alternative approach, consider the geometric interpretation de- 
picted in Figure |3l Note that the asymmetry of |0),|1) with respect to 
the densities causes the suboptimality of Pi = |l)(l|,Po = |0)(0|. Indeed, if 
|0), |1) coincided with |?7o), the power and the size would be the same in 
both cases. 

We propose a method which reaches the optimality without neither re- 
sorting to probabilistic approximations nor undergoing high computational 

■^We recall that the binary case has been introduced for the sake of clarity. 
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costs. 

The density vectors lyji), |v?o) are a superposition of both |0), |1) and 
\Vo)y \Vi) which are two different bases and induce different coordinates. 
When |0), |1) is the basis, the coordinates of \ipi) are ^/pl, a/1 — pi. When 
|?7o), \rji) is the basis, the coordinates of \ipi) are Xqo, 2;oi) 2:10, Xn such that 



and 



Iv^o) = a;oo|'7o) +a;oi|'7i) Iv^i) = a;io|r/o) + a^nlr^i) (31) 



|X| 


2 




(l-r^i)2 


+ 


|X|2 


|X| 


2 






+ 


|X|2 



+ |X|2 

2 _ l-l 2 _ i^+Vlf fnn^ 

^10- , ,..,2 ^11- (1+^^)2 + |X|2 ^"^"^^ 

As A = 1 is often assumed when ranking information units, the coordinates 
have a quite simple and intuitive meaning provided by the following expres- 
sions: 

2 1 + (f 2 I — d'^ 
^00 ~ 2 "''Oi ~ 2 ^"^'^^ 

2 1 — (f 2 1 + (i^ 
^10 = 2 ^ 2 ^"^^^ 

where 1 — d'^ = |X|2. 

As the asymmetry of |0), |1) with respect the density vectors is due to 
Pi,Po which summarize the statistics observed from the training set, we lever- 
age them for improving the ranking. In particular, in the paper we show that 
changing the estimation of pi,po is sufficient to reach the optimality. 

Then, we wonder how we should define the density vectors or matrices so 
that Qd,Qo were obtained instead of Pd,Po- The basis vectors (i.e., |0), |1) 
or |?7o), \rii)) are rotated, thus changing the coordinates. 

Therefore, we define the density vectors \{po),\ipo) in |0),|1) according 
to fl3Tl) in such a way that if a feature is observed under Hi, the probability 
of detection is Qd and, if a feature is observed under Hq, the probability of 
false alarm is Qq. The simple solution is defining the new density vectors as 
follows: 

|<^o) =a:oo|0)+Xoi|l) |<^;) = Xio|0) + a:n|l) (36) 

thus obtaining 

Q', = tr(|^'o)(^[,|PO =Qo Q', = tr(|^;)(^;|Pi) = Qd (37) 
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First Eigenvalue 


Second Eigenvalue 


1-Pi< X{l-po) 


1-Pi> X{l-po) 


Pi < Apo 
Pi > Apo 




Pi 


Po 
I 



Table 2: The regions of acceptance corresponding to the sign of the eigen- 
values of the spectrum of the discriminant function in the mixed case. The 
equality case is addressed in [TT] . 





Second Eigenvalue 


First Eigenvalue 


1 - xiip' < A(l - xoiH 


1 - xiip^ > A(l - xoiH 


1 p < A l^^oi 1 ^ 





Qo 


> A|a;oiP 


Qi 


I 



Table 3: The regions of acceptance corresponding to the sign of the eigen- 
values of the spectrum of the discriminant function in the pure case. The 
equality case is addressed in [TT] . 



At first sight, the increase of the probability of detection is due to the 
higher probability values assigned to the region of acceptance in the pure 
case than those assigned to the region of acceptance in the mixed case, and 
not to a different ranking. In the following, we show that the superiority 
of the discriminant function in the pure case is due to the different ranking 
induced by a different partition of the event space into region of acceptance 
and region of rejection. 

We state the problem as follows. Are there po,Pi, A such that the region 
of acceptance in the pure case differs from that in the mixed case? Consider 
Theorem [3] to answer the question. The region of acceptance in the mixed 
case is defined through Table [2] whereas the region of acceptance in the pure 
case is defined through Table [31 Furthermore, the discriminant function 
derived from ( l36|) is 

tr((ai - Aao)E) E G {0, Qo, Qi, 1} (38) 

where 

V A/|a;iip(l - l^iiP) l-\xii\^ 



i = OA 



(39) 
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Suppose that po = = = |. Thus, = ^,R = 0.21, xj^ ^ 
0.81 , Xqi = 0.11 and the region of acceptance in the mixed case is represented 
by I, whereas the region of acceptance in the pure case is represented by Pi. 
The counter-example just mentioned proves the following 

Corollary 1 The discriminant function fl38|) ranks information units in a 
different way from the discriminant function f lT^ because an alternative rank- 
ing is computed. 

In f lT4|) the densities that are considered are those associated to a mixed 
state, while f p8|) in the densities are the one associated to a pure state. So 
the equations look like the same, what differs is the type of densities that 
are used in the two cases. We have shown that the improvement of ranking 
measured in terms of probability of detection given a probability of false 
alarm is due to the ranking induced by Q'q, Q'^. 

Hence, we state the problem of finding Qq, Qi into the problem of defining 
the coordinates of the representation of the density vectors in |0), |1). The 
problem of defining the new coordinates for the density vectors might be 
viewed as a problem of feature weighting. In such a context, the traditional 
estimations of the probability of feature occurrence (i.e., Pi,Po) under two 
different hypothesis are replaced by the Xj/s. Feature re- weighting is explored 
in IR whose state-of-the-art is given by the BM25 weighting scheme surveyed, 
for example, in [21]. The main drawback of the weighting schemes like BM25 
is the parameter tuning necessary for optimizing the effectiveness with a given 
database or query, thus making the understanding of how and why a scheme 
is more effective than others rather problematic. 

In contrast, the paper illustrate the decision rule in such a way that if 
the decision rule is followed, then Hi shall be accepted when it is true at a 
higher probability of detection than when the probability of false alarm 
is not more than a given threshold. The formulation of the decision rule 
provided in this section allows us to design an efficient algorithm for indexing 
and retrieving (or classifying) information units. The algorithm is just an 
instance of those employed currently in IR (see [7] for example) where the 
maximum likelihood or Bayesian estimations of Po,pi are replaced by fl5^ 
and dSS]). 
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7 Related Work 



The foundations of quantum mechanics and theory has been illustrated in 
plenty of books such as [9] and [12]. Quantum probability, for example, has 
been introduced in [17J. In particular, the interference term is addressed 
in [1]. The view of probability illustrated in Section |2] is based on [19]. The 
utilization of quantum theory in computation, information processing and 
communication is described in [16] . Recently, investigations have started in 
other research areas, for example, in IR [5]. 

The paper is inspired by Helstrom's book [11] which provides the founda- 
tions and the main results in quantum detection; an example of the exploita- 
tion of the results in quantum detection is reported [B] within communication 
theory. This paper links to [22] as far it concerns density matrices and projec- 
tors; however, the paper develops quantum detection for data management. 

This paper departs from the Probability Ranking Principle (PR) proposed 
in the context of classical probability; we propose quantum probability to im- 
prove ranking in a principled way. In Information Retrieval, the Probability 
Ranking Principle (PRP) states that "If a reference retrieval system's re- 
sponse to each request is a ranking of the documents in the collection in 
order of decreasing probability of relevance to the user who submitted the 
request, where the probabilities are estimated as accurately as possible on 
the basis of whatever data have been made available to the system for this 
purpose, the overall effectiveness of the system to its user will be the best that 
is obtainable on the basis of those data." [20]. However, some assumptions 
undermine the general applicability of the PRP. We state a similar princi- 
ple yet replace classical probability, which is implied in [20], with quantum 
probability - parameter estimation data are kept the same, bu we instead 
use subspaces to define alternative regions of acceptance and rejection. 

To our knowledge, the use of quantum probability for ranking information 
units has not yet been addressed in the same way of this paper although 
a few papers that are somehow comparable can be found. Perhaps, the 
closest paper is [2S|. That paper proposes to rank documents by quantum 
probability and suggests that interference (which must be estimated) might 
model dependencies in relevance judgements such that documents ranked 
until position nl interfere with the degree of relevance of the document ranked 
at position n. This means, the optimal order of documents under the PRP 
differs from that of the Quantum PRP. Note that they empirically show that 
quantum probability is more effective that classical one in specific rankings 
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tasks. 

In contrast, in this paper, we do not need to address interference because 
quantum probability can be estimated using the same data used to estimate 
classical probability. We rather show that not only ranking by quantum 
probability provides a different optimal ranking, it is also more effective than 
classical probability. With this regard, the effectiveness of quantum proba- 
bility measured in stems from the estimation of classical probability and 
that of interference. But, the regions of acceptance and rejection are still 
based on sets. It follows that the optimality of the Quantum PRP strongly 
depends on the optimality of the PRP and on the interference estimantion 
effectiveness. In this paper, on the contrary, ranking optimality only depends 
on the region of acceptance defined upon subspaces. 

Another paper somewhat related to ours is [18]. The authors discuss 
how to emply quantum formalisms for encompassing various Information 
Retrieval tasks within a single framework. From an experimental point of 
view, what that paper demonstrates is that ranking functions based on quan- 
tum formalism are computationally feasible. The best experimental results of 
rankings driven by quantum formalism are comparable to BM25, that is, to 
PRP, thus limiting the contribution within a classical probability framework. 

Probabilistic databases systems manage imprecise data and provide tools 
for structured complex queries. A survey is provided in [8]. Beside scalabil- 
ity and query plan execution, these systems do probabilistic inference which 
may be defined upon classical or quantum probability and they concentrate 
on top- A; query answering where the tuples are assigned a probability distri- 
bution. The results of this paper may be applied to probabilistic databases 
systems too. 

8 Future Developments and Conclusions 

The main result of the paper is the demonstration that quantum probability 
can be incorporated into a data management system for ranking information 
units. As ranking by quantum probability is more effective than ranking by 
classical probability when it has been used in other domains, it is our belief 
that an analogous improvement can be achieved within data management. 

The future developments are threefold. First, we will work on the in- 
tepretation of the optimal projectors in the pure case because the detection 
of them in an information unit may open further insights. Second, feature 



25 



classical correlation and quantum entanglement will be investigated. Third, 
evaluation is crucial to understand whether the results of the paper can be 
confirmed by the experiments. 
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A Dirac Notation 

A complex vector x is represented as \x) and is called "ket". The conjugate 
transpose of x is represented as {x\ and is called "bra" (therefore, the Dirac 
notation is called the bra(c)ket notation). 

The inner product between x and y is represented as {x\y), which is a 
complex number. The inner procuct between x and itself is {x\x). 

The outer product (or dyad) is \x){y\. A special case of dyad is \x){x\ 
which is the projector made onto x. 

If A is a matrix (or an operator), then A\x) is the vector resulting from 
the linear transformation represented by A. \x){x\ is also an operator because 
it is a projector. 

The real number is the squared inner product between x and y. 

Moreover, {x\A\y) = tT{{x\A\y)) and the properties of trace allow us to 
write tT{{x\A\y)) = tr(A(x|y)). In general, if A, B arc trace-1 Hermitian 
operators, and B is a projector, < tr(AB) < 1 is the probability that the 
event represented by B occurs given a density operator A. 

The Dirac notation allow us to write complex expressions in an elegant 
way, for example, 

{x\y){y\z){z\x) = tT{{x\y){y\z){z\x)) 

= tri\x){x\\y){y\\z){z\)) 
= tT{{z\x){x\y){y\z)) 
When A is mixed and E is a projector, 

tr(AE) = 




r-l 



= J]a;,tr(EiE) 



i=0 



28 




Figure 2: A graphical representation of the ROC curves. Qd is the curve 
above the polygonal curve depicting P^. The classical probability ROC curve 
intercepts the quantum probability ROC curve at (pcPi), (1— Po, 1 —Pi) ^-iid 
(1 — po, 1 — Pi), (1, 1) where Pi, Pq are observed. = 1 for Qq > |Xp and 
Qd > Pd for all Qo's. 
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Figure 3: The geometric view of the canonical vectors in the mixed case, the 
optimal vectors in the pure case and the vectors representing the densities. 
The Greek letters Q!,/9,7 indicate the angles between the vectors. 
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